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Abstract — The inference of gene predictors in the gene 
regulatory network has become an important research area 
in the genomics and medical disciplines. Accurate predi- 
cators are necessary for constructing the GRN model and 
to enable targeted biological experiments that attempt to 
confirm or control the regulation process. In this paper, 
we implement a SAT-based algorithm to determine the 
gene predictor set from steady state gene expression data 
(attractor states). Using the attractor states as input, the 
states are ordered into attractor cycles. For each attrac- 
tor cycle ordering, all possible predictors are enumerated 
and a CNF expression is formulated which encodes these 
predictors and their biological constraints. Each CNF is 
explored using a SAT solver to find candidate predictor 
sets. Statistical analysis of the results selects the most likely 
predictor set of the GRN corresponding to the attractor 
data. We demonstrate our algorithm on attractor state data 
from a melanoma study [1] and present our predictor set 
results. 



I. Introduction 

With the mapping of the human genome com- 
plete, the focus in computational biology has shifted 
from sequence analysis to the understanding of gene 
regulation and its inter-relation with the biological 
system. The use of genome information has given rise 
to the notion of "personalized medicine" - targeted 
and specific disease prevention and treatment based 
on individual gene information [2], [3]. The urgent 
applications to cancer and gene-related diseases calls 
for the genomics field to significantly improve the 
algorithms used for accurate inference of the gene 
regulatory network (GRN). 

In an organism, the genome is a highly complex 
control system wherein proteins and RNA produced 
by genes interact with and regulate the activity of 
other genes [4]. The activity of a target gene gi is 
regulated (or predicted) by the genes in its predictor 
(e.g. if gi becomes inactive when g 2 and g 3 are active, 



then g 2 and g 3 are called predictors of g). The com- 
plete set of predictors (predictor set), which contains 
the predictors for each gene in the GRN, describes 
the interaction of all genes within the gene regulatory 
network and is the prerequisite for inferring the GRN 
structure. 

There are several GRN characteristics that impact 
the formulation of our GRN model and predictor 
inference algorithm. First, the gene activity level of all 
genes at a particular time t represents the state of the 
GRN at that time t. From our knowledge of biological 
systems, we observe that over time, cellular processes 
transition to stable attractor states. Some of these 
attractor states represent normal cellular phenomena 
in biology such as cell cycle and division. However, 
some attractor states are consistent with disease such 
as the metastasis of cancer. Second, the GRN is often 
inferred by observing microarray-based experimental 
data which measures the activity level of genes. The 
correlation of the observed gene activity (or state) 
can be used to help describe the gene regulation. The 
disadvantage of using microarray data is such that 
studies do not involve controlled time experimental 
data (time-series data). Hence the measurements are 
assumed to arise from the cyclic sequence of gene 
expressions (attractor states) in steady state (attractor 
cycles). The GRN is then inferred from this data, 
using methods traditionally based on probabilistic 
transition models [5], [6]. 

As previously mentioned, it is necessary to deter- 
mine the predictor set to reconstruct the GRN. How- 
ever, there may exist many possible predictors for any 
gene, based on the attractor cycle data. Furthermore, 
only certain combinations of predictors may form a 
valid predictor set due to biological constraints. The 
issue addressed in this paper is how to efficiently 
and deterministically select the predictors that form 
the predictor set. We have implemented a Boolean 



satisfiability (SAT) based algorithm for the inference 
of gene predictors. Satisfiability is a decision problem 
of determining whether the variables in a Boolean 
formula (expressed in Conjunctive Normal Form or 
CNF) can be assigned to make the formula evaluate 
to true. Although SAT is NP-complete, many SAT 
solvers have been developed to quickly and efficiently 
solve large SAT problems. Our algorithm takes ad- 
vantage of advanced SAT solvers to find the predictor 
set. 

The basic outline of our SAT-based algorithm is 
described briefly below. First, all possible orderings 
of attractor state are enumerated, yielding all possible 
attractor cycles. For each ordering, we enumerate all 
predictors that are logically valid, and create a CNF 
expression which encodes all these predictors and 
biological constraints (such as cardinality bounds on 
the predictors). A SAT solver is used to find the valid 
candidate predictor sets. After this process is done 
iteratively for all attractor cycle (orderings), statistical 
analysis provides the most likely candidates for the 
predictor set. 

The key contributions of this paper are: 

• We develop a Boolean Satisfiability based ap- 
proach to realize the gene predictor set from 
attractor state data. 

• We modify an existing SAT-solver (MiniSat [7]) 
for efficient all-SAT computation and further op- 
timize MiniSat for improved predictor inference. 

• On gene expression data from a melanoma 
study [1], we apply our SAT-based algorithm 
and present results for genes that regulate all the 
genes, including the cancer gene WNT5a. 

• Our approach can be used to find the predic- 
tor set for any gene related disease, provided 
attractor state data is available. The predictor 
set information obtained from our algorithm can 
be used by biologists to fine tune their gene 
expression experiments. 

The remainder of this paper is organized as follows. 
Section [XT] describes previous work in modeling the 
gene regulatory network and inference of gene predic- 
tors. Section Hn] presents our FSM model and Boolean 
SAT approach. Section [IV] reports experimental re- 
sults. Concluding comments and future work are 
discussed in Section |V] 

II. Previous Work 

Several models have been proposed for modeling 
the GRN such as Markov Chains [8], [9], Coupled 



ODEs (ordinary differential equations), Boolean Net- 
works [10], [11], Continuous Networks [12], and 
Stochastic Gene Networks [13]. 

This paper utilizes the Boolean Network (BN) 
model that was proposed by Kauffman in 1969 [10]. 
In a Boolean Network, the expression activity of 
a gene is represented as a binary value, where 1 
indicates the gene is ON (active) and producing gene- 
products, while indicates it is OFF. Such a model 
cannot capture the continuous and stochastic bio- 
chemical properties of protein and RNA production. 
However, genes can typically be modeled as ON or 
OFF in any particular biochemical pathway. 

In [14], [15], the probabilistic modeling framework 
is represented by dynamic Bayesian networks and 
probabilistic Boolean networks (PBNs). The method 
proposed considers gene prediction using multinomial 
probit regression with Bayesian variable selection. 
Genes are selected which satisfy multiple regression 
equations, of which the strongest genes are used 
to construct the predictor set. The target gene is 
predicted based on the strongest genes, using the 
coefficient of determination to measure predictor ac- 
curacy. 

Another method proposed by [16] also assumes 
PBN. A partial state transition table is constructed 
based on available attractor state data. From this state 
transition table, predictors with 3 or less regulating 
genes are selected for each target gene. All unknown 
values in the table are randomly set. The Boolean 
network is simulated for several iterations on several 
starting states, observing whether the states eventually 
transition to an attractor cycle. If the simulation suc- 
cessfully transitions to attractor cycles, the selected 
predictors are considered as a valid predictor set. This 
process is repeated to build a collection of Boolean 
Networks which are combined to form a Probabilistic 
Boolean Network (PBN). 

Our larger goal is to find a small number of 
deterministic GRNs, rather than a PBN. Towards this, 
we need to find ways to accurately find the predictor 
set. This is the focus of this paper. Philosophically, 
our aim is to invest effort into accurate predictor set 
determination, so that the results can be used to find 
high quality deterministic GRNs. 

III. Our Approach 

This section describes our model and algorithm for 
inference of predictor sets using SAT. We begin with 
some logic synthesis definitions which are useful in 



understanding the application of SAT to GRNs and 
predictor selection. We then describe a simple ex- 
ample to explain the algorithm. Lastly, we generalize 
the algorithm for larger problem sets and comment on 
specific issues about the use of SAT and complexity. 

A. Definitions 

Definition 1: A literal or a literal function is a 

binary variable x or its negation x. 

Definition 2: A cube is a product of a set of literal 
functions. 

Definition 3: A clause is a disjunction containing 
literals. 

Definition 4: A Conjunctive Normal Form 
(CNF) expression consists of a conjuction (AND) 
of m clauses c\ . . . c m . Each clause q consists of 
disjunction (OR) of k number of literals. 

A CNF formula is also referred to as a logical 
product of sums. Thus, to satisfy the formula, each 
clause must have at least one literal evaluate to true. 

Definition 5: Boolean satisfiability (SAT). Given 
a Boolean formula S on a set of binary variables 
X , expressed in Conjunctive Normal Form (CNF), 
the objective of SAT is to identify an assignment of 
the binary variables in X that satisfies S, if such an 
assignment exists. 

For example, consider the formula S(a, b, c) = (a + 
b) ■ (a + b + c). This formula consists of 3 variables, 2 
clauses, and 4 literals. This particular formula is sat- 
isfiable, and a satisfying assignment is < a,b,c > = 
< 0,0,1 >, which can be expressed as the satisfying 
cube abc. 

There may exist many satisfying assignments for 
the formula in question. An extension of the SAT 
problem is to find all satisfying assignments (or All- 
SAT). One simple method to accomplish All-SAT is 
to repeatedly run SAT on the formula S, express each 
satisfying assignment as a cube k, complement k to 
get a clause c, and add c as a new clause of the 
formula and running SAT again. The inclusion of c 
in S ensures that the same cube k cannot be found as 
a satisfying assignment again. The process continues 
until no new solutions can be found. In the previous 
example, the satisfying cube abc is complemented and 
added as a new clause (a + b + c) to the original 
formula to be solved by SAT again (this is repeated 
until no new satisfying assignments are found). 

Definition 6: A predictor /j = {gj, g k , ■ ■ •} lists 
the set {gj,gk, • • } of genes which regulate the ac- 
tivity of gene <&. 



Definition 7: The predictor set is the complete 
set of predictors {/ 1; f 2 , ■ - ■ , f n } for the GRN with 
n genes gi,g 2 ,- ■ • ,g n . 

B. Implementation and Example 

Given gene expression data (a set of attractor 
states) as input, we would like to determine the best 
predictor set. We first present an outline of our SAT- 
based algorithm, and then explain the steps through 
a simple example. 

The algorithm has three main steps. 

• First, attractor states are ordered into attractor 
cycles. For each possible ordering of the attractor 
states in to attractor cycles, all possible predic- 
tors are found and a CNF is generated containing 
the predictors and constraints. 

• Second, the CNF is solved for All-SAT, record- 
ing all satisfying cubes. Each cube corresponds 
to a predictor set. The first two steps are repeated 
for all attractor cycle orderings. 

• Finally, statistical analysis on the SAT results 
determines the most frequent (likely) predictor 
set for the GRN. 

We apply the SAT-based algorithm to a simple 
example with three genes (#i, #2) #3) an d gene ex- 
pression data with two lines (000, 101). The present 
state of these genes is represented by the variables 
< x\,x 2l x?, > and the next state is represented by 
the variables < y±, y 2 , 2/3 >. We assume each line was 
measured in steady state and therefore is an attractor 
state. 

Step 1: We order (or arrange) the attractor states 
into valid attractor cycles, of which there are two 
possibilities. One ordering is with each attractor state 
transitioning to itself with a self-edge, thus resulting 
in an attractor cycle of length one, as shown in 
Table H The other possible ordering is a transition 
from one state to the other and back, forming a single 
attractor cycle of length two. 

For each valid attractor cycle ordering, a partial 
state transition table is constructed containing the 
attractor states. For example, if the first attractor cycle 
ordering (in which states transition to themselves) is 
chosen, the resulting state transition table is shown in 
Table IB To find all valid predictors of a gene, each 
next state column is checked against all combinations 
of the present state columns. For example with gene 
<7i, the next state bit j/i is in the first row and 1 
in the second row. Hence the present state bit x 2 
alone cannot predict y 1 as there is a contradiction 
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Example state transition table 



(since y\ — from the first row and y\ — 1 in the 
second row and x 2 = for both rows). However 
if we consider state bits x 2 and £3, we find that 
they together can predict y ls since the pair of values 
< £2, £3 > i s different in the two rows. Thus gene 
gi can be regulated by genes g 2 and g 3 , so one valid 
predictor for g\ is f\ = {x 2 ,x 3 }. All valid predic- 
tors with 3 or less inputs are exhaustively searched 
and recorded for CNF formulation in the next step. 
In our example, gene g\ has 5 possible predic- 
tors {x 3 }, {x 1 ,x 2 }, {xi,x 3 }, {x 2 ,x 3 }, {xi,x 2 ,x 3 } 
which we label 44 v\,v\ respectively. We assume 
that a gene cannot self-regulate, so {xi} is not a valid 
predictor. 

Step 2: After all predictors are found, we generate 
the SAT formula which encodes valid predictor sets 
for all possible predictor combinations. Each predic- 
tor is assigned a variable v* which corresponds to 
the j th predictor for gene i. Gene g\ in our example 
will have five predictor variables v\ = {£3}, v\ = 
{xi,x 2 }, v\ = {x 1 ,x 3 },vl = {x 2 ,x 3 }, v\ = 
{xi, X2, x 3 }. Gene g 2 will have the predictor variables 



{xi,x 2 }, v\ = {xi,x 3 }, vl = {x 2 ,x 3 }, V. 



{xi, x 2 , x 3 }. Gene g 3 will have the predictor variables 
v i = i x i}i 4 = {xi,x 2 },i% = {xi,x 3 }, v\ = 
{x 2 ,x 3 }, v\ = {xi,x 2 ,x 3 }. There are three con- 
straints that we incorporate while constructing the 
CNF. The conjuction of these constraints our final 
CNF. 

1) The first constraint (Si) is that all genes in the 
GRN must have a predictor. In other words, 
we assume that all genes are "participating" 
in the GRN and that all genes predict at least 
one other gene. For gene i, all of its associated 
predictor variables are written in a single clause 

c \ = ( v \ H 1" v j)- The clause for gene g\ in 

our example is formulated as c\ = (v\ + v\ + 
v\ + v\ + v\). To satisfy this satisfy this clause, 
at least one predictor among v\, - ■ ■ , v\ must be 
chosen. Then to ensure at least one predictor is 
chosen for all genes we write the conjunction 
of all c] clauses. 



Si 



C l ' C 2 ' C 3 



Si = (v\ + v\ + v\ + v\ + v\) ■ (v\ + v\ + v\ + 
v\) ■ (v\ + vl + vl + v\ + vfj 
2) The second constraint (S 2 ) specifies that for 
each gene, there exists only one predictor. A 
gene cannot be regulated by multiple sets of 
predictors. To formulate the clauses cf for gene 
i, smaller clauses are formed from all pair 
combinations of its predictors v\...j. In each of 
these clauses of pair of variables, both predictor 
variables are complemented. For gene g\, (? x = 

@+3) • k+3) • (4+3) • S.+3) • (4+ 4) • 

(vl+vD^vl + vD-^l + vD-^ + vD^vl + vl) 
Any selection of two or more predictors for 
gene 1 will result in a unsatisfiable solution. 
Because the c\ clause ensures at least one 
predictor will be chosen, cf forces our selection 
to choose at most one predictor gene i. Then 
constraint S 2 includes cj t so that at most one 
predictor is chose for each gene. 



So 



3' 



where 



4_= (4 + 4) • (4 + 4) • (4 +4) - (4 +4) ■ 

(vl + vl) ■ (vl + vl) ■ (vl + vl) ■ (vl + vl) ■ (vl + 
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4 



H 



4 



(4 + V D ■ {vl + vl) ■ (vl + vl) ■ (v 2 2 + v\ 



(v 2 2 + vl) ■ (vl + vl) 



c\= (4 + 4) ■ (4 + 4) ■ (4 + 4) ■ (4 + 4) ■ 
14 + 4) ■ (4 + v ±) ■ (4 + vl) ■ (vi + vi) ■ (vi + 

vf) ■ (vl + vl) 
3) The last constraint (S3) requires that each genes 
must be used as a predictor for at least one 
other gene in the satisfying predictor set. A 
gene that is not used in any predictor does 
perform any regulation function and could be 
removed from the GRN. S3 ensures that this 
does not occur. To ensure that gene ^ is used 
in at least one other predictor, we form clauses 
cf which includes all predictors that use gene gi 
as input. To specify that gene $ must be used, 
we also include a single variable clause (xj) to 



c? and add an additional literal x, to the other 



other clauses in cf. The clause (xj) requires 
our solution to include gene g^ and the xl in 
the other clauses of cf forces at least one other 
predictor variable in cf be selected to satisfy the 
formula. For example, the S 3 clauses for gene 



gi are c\ = (x±) ■ (xi + 



vl 



vl + v I + v I + 



4 + v i + v 2 + 4 + 4)- To sa tisfy these clauses, 
Xi and at least one other predictor variable in 
cf must be selected. Again, S 3 includes cf for 



all genes, so: 




£3 = c\ ■ (? 2 ■ c\, where 


10% 


C l = ( x l) • (Zl + V 2 + V l + V l + V 1 + V 2 + V l + 


£ m 


vf + vl + v\ + vf) 
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All-SAT Average Predictor Error with Random Variable Selection 



C 2 = (^2) ■ (xT + vl + vl + vl + vl + vl + vj + 

vl + v\ + vf) 

c 3 = ( x s) • {Xl + v\ + v\ + v\ + v\ + W2 + V3 + 
V 4 + U 3 + V \ + U I) 

Finally we create the SAT formula S as a 
conjunction of the Si formulas. 

S = 61 • 02 ■ 03 

Step 3: Constraints together form the CNF S on 
which the SAT solver performs an All-SAT The 
cubes (each cube encodes a candidate predictor set) 
from the All-SAT are collected and the process 
repeats for the remaining attractor cycle ordering s. 
From the results, we find the most likely predictors 
based on the frequency of occurrence of the predic- 
tors across all orderings. Three methods are used to 
analyze the statistical results, which will be described 
in Section JVl 

In general, the algorithm can be applied to input 
data for N genes and A attractor states. The total 
number of attractor state orderings is A\. For each 
ordering, there can be up to 0(N 3 ) predictors per 
gene. Then SAT search space per ordering is on the 
order of 0(2 N ) resulting in overall complexity of 
0(A\2 N ). Typically, the number of attractor states A 
recorded through gene expression measurements are 
small. As such, A\ is thus much smaller than 2^ \ so 
the runtime complexity is dominated by the All-SAT 
operation. For pragmatic reasons, our algorithm stops 
each All-SAT after T minutes (or C cubes), where T 
or C is defined by the user. 

The SAT solver used in our algorithm is based 
on MiniSat [7]. We modify MiniSat to perform 
All-SAT optimized for predictor inference with two 
main changes. First, we loop the SAT solving pro- 
cess internal to MiniSat automatically complementing 
satisfying assignments (cubes) and appending the 
resulting clause to the CNF. Second, we modify 
MiniSat to randomly select branch-variables during 
the solving process. Because MiniSat is originally 
designed for finding a single satisfying assignment, 
MiniSAT uses a decision heuristic for determining 
variables of the final solution. However, this heuristic 
will result in many of the same variables being 
chosen over iterative runs of MiniSat. To increase 
the activity of all variables, we change the random 
variable frequency of MiniSat to 100% (from 2% 
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Fig. 1. Average predictor error difference on melanoma attractor data 
using MiniSat with random variable selection modification 
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Fig. 2. Average predictor error difference on melanoma attractor data 
using MiniSat without modification 



in the unaltered MiniSat code) to force MiniSAT to 
always choose a random variable on every variable- 
branch decision. A random variable freqeuncy of f% 
means that MiniSat selects the next variable randomly 
f% of the time. 

To confirm the quality of predictor selection of 
our modified All-SAT, our algorithm was run on 
four selected attractor cycle orderings (labeled 10, 
721, 744, and 849) using melanoma data from [1], 
allowing the All-SAT operation to run for 12 hours or 
until all cubes were found, whichever was first. In the 
case of attractor cycle order 721, all cubes were found 
under 12 hours. We assume that 12 hours of runtime 
produce predictor results closely identical to a com- 
plete All-SAT. In Figure \T\ and Figure [2l we compare 
the average difference in predictors frequency of 
the 12 hour (or complete All-SAT) results with the 
results obtained with shorter All-SAT runtimes (of 
10, 30, 60, and 120 minutes). Figure Q] shows the 
average error difference of all predictors for the four 
orderings using MiniSat with the random variable 
selection modification (100% random variable fre- 
quency), while Figure [2] shows the same MiniSat 
results without random variable selection (2% random 



variable frequency). For example, with attractor cycle 
order 721, predictor fi = {#3, #5, #7} had a frequency 
of occurrence of 50% with a 12 hour runtime. Using 
random variable selection and a 30 minute runtime, 
the same predictor had 43% occurrence, resulting in a 
difference of 7%. Without random variable selection, 
the predictor had a 69% occurrence, a 19% error. 
Across the four orderings analyzed, the average error 
difference over all predictors (shown in Figures Q] 
and |2]) is significantly lower using the random vari- 
able selection modification than without. At 120 
minutes, the random variable selection method has a 
predictor occurrence that differs from true All-SAT 
by about 3%, while without random modification, 
the difference is about 8%. From this experiment, 
we determine that 30 minutes with random variable 
selection was sufficient to achieve an average of < 
5% difference from the true All-SAT results. 

IV. Experimental Results 

To evaluate our SAT-based algorithm for inferring 
gene predictors, the algorithm was tested on gene- 
expression data from a melanoma study done by 
Bittner and Weeraratna [1]. In the melanoma study, it 
was observed that an abundance of RNA (expression) 
for gene WNT5A was associated with a high metas- 
tatis of melanoma. The study measured 587 genes 
with 31 gene expression patterns (lines). Seven genes 
are believed to be closely knit: PIRIN, S100P, RET1, 
MARTI, HADHB, STC2, and WNT5A. There are 18 
distinct patterns, which were reduced to seven using 
Hamming-distance of one in Table [HI These seven 
lines form the attractor states which are the input to 
our algorithm. 

Our algorithm utilizes a modified open-source and 
highly efficient exact SAT-solver called MiniSAT 
vl.14 [17], [7]. All-SAT operations were limited to 
30 minute time-out. On average, each All-SAT run 
yielded 10K satisifying cubes. Our algorithm was 
implemented and run on a Pentium 4 Linux machine 
with 4GB RAM. 

For the experiments, we assume two additional re- 
strictions on attractor cycle orderings. First, we divide 
attractor states into good and bad states based on the 
presence of WNT5A. We allow good attractor states 
to cycle only to other good attractor states, and bad 
attractor states can only cycle to other bad attractor 
states. Second, we limit the attractor cycle length to 3 
or less since long attractor cycles are highly complex 
and unlikely to occur in most biological systems. 



In Figure [3j we display a histogram of all valid 
predictors and their frequency of occurrence over all 
attractor orderings. In this chart and table of results, 
a predictor label of 2367 means that gene §2 is 
predicted by genes g-3,ge, and g-j. From this chart, 
we can observe that certain predictors occur with 
significantly higher frequency than others. For exam- 
ple with gene g\, the predictor {xs,x 5 ,x 7 } (PIRIN 
predicted by RET1, HADHB, WNT5A) occurs with 
much higher frequency than all other predictors for 
gene g 1 . This indicates that this predictor is most 
likely to be present in the final predictor set. 

From this data, we propose two methods (A and 
B) for selecting the predictor set. In method A, a 
predictor histogram is created as in Figure |3] From 
the histogram, for each gene g-i, we find its predictor 
fij such that: 

• p l j is the most frequently occurring predictor of 
gene g % . 

• The resolution ratio Ri of this predictor (de- 
fined as the ratio of the occurrence frequency 
of p l j to the occurrence frequency of the next 
most frequently ocurring predictor of gene g^) is 
maximum. 

Among all genes, we choose the one with the 
highest resolution ratio, and select its most frequently 
occuring predictor as its final predictor. After se- 
lecting this final predictor, regenerate the histogram, 
discarding any cubes that do not contain the final 
predictor(s) that have been selected in previous steps. 
The process repeats until all genes have a single final 
predictor. The set of final predictors of all genes forms 
the predictor set. The advantage of method A is that 
at every iteration, we select real predictors that have a 
high overall occurrence in the solution. However the 
method may have problems selecting final predictors 
if the resolution ratio is low (i.e. when the frequencies 
of occurrence of the predictors are nearly identical). 

As an alternative, method B is proposed to deter- 
mine for each gene i, how likely it is that gene gi will 
predict the other genes in the GRN. In other words, 
we ask what is the occurrence frequency of Xi in 
the predictors of fj. Table ITTll shows how frequently 
a gene g^ is used to predict a gene gj. This table 
is populated by summing the occurrence frequency 
of all predictors of gj that have gene gi as one if 
their inputs. As such, any entry can be >100, and 
is a measure of the usefulness of g t as a predictor 
for gj. This is done by finding, for each column j 
of Table [Till the three largest entries and adding their 
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Fig. 3. Method A: Predictor occurrence for all valid attractor cycle orderings (first iteration: no predictor selected) 
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TABLE II 

Attractors for Melanoma Network 



TABLE III 
Method B: Gene occurrence for all predictors (first 

ITERATION: NO PREDICTOR SELECTED) 



values. Suppose we call this value Sj or the resolution 
score of column j. We compute the resolution score 
for all columns and find the final predictor the colum 
with the highest resolution score. This is done by 
listing the 3 input genes that correspond to the 3 
entries that were used to compute the highest res- 
olution score. Similar to method A, we reiterate the 
process by regenerating the table after discarding all 
cubes that do not contain predictors that were selected 
in previous steps. Method B has the advantage of 
being more robust when no single predictor has a 
significantly higher occurrence frequency than others. 
However, there is no guarantee that the predictor 
selected by method B is a valid predictor. 

In our experiments, we also use a hybrid method 
AB which works in the following manner. Both meth- 
ods A and B are used to select their best predictor. 
If both methods produce the same predictor f it we 
select this predictor as a final predictor. If not, we list 
the best predictors for each gene for both methods. 
If multiple predictors match for both methods, we 
choose the final predictor as the one with the highest 



weighted sum of the resolution ratio and resolution 
score. The resolution ratio is weighted by 0.3 and the 
resolution score is weighted by 0.7. The weighting 
factor for resolution ratio is lower since the resolution 
ratio values of any gene are often close to 1 . In such 
a situation, we would like to favor method B. If no 
predictor is produced by the previous step, we look at 
the top five predictors of method A for each gene and 
calculate the weighted sum of their resolution ratio 
and resolution score. The predictor with the highest 
weighted sum is selected as the final predictor. The 
process is reiterated, regenerating the histogram and 
table at each step, by discarding any cubes that do not 
contain any of the previously selected final predictors. 
With this combined approach, we are able to select 
predictors with a higher degree of confidence and 
robustness. 

We ran our experiments on the melanoma attractor 
data of [1] using methods A, B, and AB. Results 
are shown in Tables [TV] |Vj and |Vl] respectively. 
Each table shows what predictor was selected at each 





Iteration 


1 


2 


3 


4 


5 


6 


7 


Predictor selected 


1357 


6124 


3146 


7124 


5124 


4357 


2137 


Resolution ratio 


2.57 


1.66 


1.34 


1.31 


1.41 


1.30 


1.41 



TABLE IV 
Predictor set selection using method A 





Iteration 


1 


2 


3 


4 


5 


6 


7 


Predictor selected 


1357 


3146 


4137 


7124 


2367 


6357 


5137 


Resolution ratio (A) 


2.57 


1.07 


1.11 


1.57 


1.28 


1.77 


1.01 


Resolution score (B) 


1.85 


2.04 


1.83 


2.01 


1.70 


1.23 


1.69 



TABLE VI 
Predictor set selection using combined method A and B 





Iteration 


1 


2 


3 


4 


5 


6 


7 


Predictor selected 


7126 


3146 


5134 


4137 


2137 


1357 


6137 


Resolution score 


2.56 


1.84 


1.99 


1.97 


1.77 


1.78 


1.98 



TABLE V 
Predictor set selection using method B 



iteration and the accompanying resolution ratio (or 
score). Using method A, the predictor set contains 

{A = {3,5,7}, h = {1,3,7}, f 3 = {1,4,6}, 
U = {3,5,7}, h = {1,2,4}, U = {1,2,4}, 
fj = {1,2, 4}}. The predictor set selected by method 
B contains {j\ = {3,5,7}, f 2 = {1,3,7}, f 3 = 
{1,4,6}, U = {1,3,7}, h = {1,3,4}, / 6 = 
{1,3,7}, f 7 = {1,2,6}}. Finally, the predictor set 
determined through combining method A and B re- 
suits in {h = {3, 5, 7}, f 2 = {3, 6, 7}, / 3 = {1,4, 7}, 
U = {1,3,7}, h = {1,3,7}, h = {3,5,7}, f 7 = 
{1,2,4}}. 

From the experiment data, we can draw several 
conclusive results: 

• It should be noted that the final predictor set 
from each method is a valid satisfying cube of 
the SAT formula S. The iterative steps in regen- 
erating the histogram (or table) retain only cubes 
that contain previously selected final predictors. 

• The algorithm enables us to generate a few 
deterministic GRNs. The final predictor set is 
present in a select number of attractor cycle 
orderings. For example, the final predictor set 
selected by methods A, B, and AB are found in 
respectively 8, 4, and 6 attractor cycle orderings 
out of the total 5040 possible orderings. 

• Some predictors are common among the predic- 
tor sets between the three methods. For example, 
all three methods select /i = {<? 3 , g 5 , g 7 } (PIRIN 
predicted by RET1, HADHB, WNT5A). We can 
conclude this predictor is highly likely to be 
a final predictor in the GRN. Also, many that 
are predictors selected by the three methods, 
while different, share common input genes. For 
example, the exact predictor selected by each 



method is different for gene g 2 (SI OOP), but all 
f 2 predictors contain 2 common genes {#3, g 7 } 
(RET1, WNT5A), meaning these 2 genes are 
likely to be contained in the final predictor of 

h- ' 
• Using the above results, biologists can target 
their research on gene regulation and control, 
focusing on the gene relationships determined 
by the predictor set results. 

V. Conclusions 

Determining the predictor set for a gene regulatory 
network is important in many applications, particular 
inference and control of the GRN. In this paper, we 
formulate gene predictor set inference as an instance 
of Boolean satisfiability. In our approach, we deter- 
mine all possible orderings of attractor state data, gen- 
erate the CNF encapsulating predictor and biological 
constraints, and apply a highly-efficient and modified 
SAT solver to find candidate predictor sets. The SAT 
results are analyzed using three selection methods to 
produce the final predictor set. We have tested our 
algorithm on attractor state data from a melanoma 
study, and determined the predictor sets for this GRN. 

Encouraged by these results, we plan to expand our 
SAT-based algorithm to utilize weighted max SAT. 
This would provide a more flexible platform where 
every predictor has an associated weight (or impor- 
tance) in the SAT formulation. The weighted max 
SAT algorithm can be tailored for more restrictive 
biological constraints, and also would allow biologists 
to selectively increase or decrease weights on specific 
predictors. This work will incorporate the predictor 
set results to implement an algorithm for the inference 
of the complete GRN structure. 
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